F/6  12/1 


AO-A101  92*  PURDUE  UNIV  LAFAYETTE  IN  E)EPT  OF  STATISTICS 
ON  ELIMINATING  INFERIOR  REGRESSION  MODELS. (U) 

JUL  61  D  HUANG*  S  PANCHAPAKESAN  N0001*-75-C-0*55 

UNCLASSIFIED  MNS-81-25  NL 


ADA101924 


VDIB^ 


PURDUE  UNIVERSITY 


81  7  21  Q24  J 


_£N  ELIMINATING  JNFERIOR  REGRESSION  MODELS** 

by 

J  )  Deng- Yuan i/fluang  S./Panchapakesan 

Institute  of  Mathematics  Department  of  Mathematics 
Academia  Sinica  Southern  Illinois  University 

Taipei,  Taiwan  Carbondale,  Illinois 


i!j)  Mimeograph  $eries^#81-25 


Department  of  Statistics 
Division  of  Mathematical  Sciences 
Mimeograph  Series  #81-25 

//  July  1981 


*This  research  was  supported  by  the  Office  of  Naval  Research  contract 
/  )  NC(j^iDl4-75-C-j3455  at  Purdue  University.  Reproduction  in  whole  or  in 
'  v/  part  i-&  permitted  for  any  purpose  of  the  United  States  Government. 


ON  ELIMINATING  INFERIOR  REGRESSION  MODELS* 


Deng-Yuan  Huang 
Institute  of  Mathematics 
Academia  Sinica 
Taipei,  Taiwan 


S.  Panchapakesan 
Department  of  Mathematics 
Southern  Illinois  University 
Carbondale,  Illinois 


Key  Words  and  Phrases:  Linear  regression  models;  eliminating  inferior 
models;  multiple  correlation;  guaranteed  probability  of  correct 
decision. 

AMS  1980  subject  classifications:  Primary  62F07;  secondary  62J05 


ABSTRACT 

Consider  a  linear  regression  model  with  (p-1)  predictor  variables 
which  is  taken  as  the  "true"  model.  The  goal  is  to  select  a  subset  of  all 
possible  reduced  models  such  that  all  inferior  models  (to  be  defined)  are 
excluded  with  a  guaranteed  minimum  probability.  A  procedure  is  proposed 
for  which  the  exact  evaluation  of  the  probability  of  a  correct  decision  is 
difficult;  however,  it  is  shown  that  the  probability  requirement  can  be  met 
for  sufficiently  large  sample  size.  Monte  Carlo  evaluation  of  the  constant 
associated  with  the  procedure  and  some  ways  to  reduce  the  amount  of 
computations  involved  in  the  implementation  of  the  procedure  are  discussed. 


1.  INTRODUCTION 

A  problem  of  great  interest  to  many  practitioners  of  linear  regression 
analysis  is  that  of  selecting  an  appropriate  subset  of  the  predictor 
variables  which  adequately  describe  the  variance  of  the  response  variable. 

Some  of  the  commonly  employed  techniques  are  all  possible  regressions,  forward 
selection,  backward  selection  and  stepwise  procedures.  These  procedures  along 
with  some  variations  and  computational  methods  are  given  in  Draper  and  Smith 
(1966).  Several  criteria  for  defining  the  best  set  of  predictor  variables  and 
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various  techniques  for  selecting  the  best  set  have  been  discussed  in  a 
nice  expository  paper  by  Hocking  (1976).  A  brief  review  of  these  methods 
is  also  given  by  Thompson  (1978).  However,  these  techniques  do  not  have 
an  accompanying  probability  guarantee  for  selecting  the  best  set;  moreover, 
the  measures  of  goodness  of  models  are  based  on  the  data.  Formulation  of 
this  problem  in  the  framework  of  the  multiple  decision  subset  selection 
procedures  of  Gupta  (1956,  1965)  has  been  recently  considered  by  Arvesen 
and  McCabe  (1975),  Gupta  and  Huang  (1977),  and  McCabe  and  Arvesen  (1974). 

We  adopt  the  same  framework  here.  For  details  of  the  general  subset 
selection  theory,  see  Gupta  and  Panchapakesan  (1979). 

Formulation  of  the  problem  is  given  in  Section  2.  In  the  next  section, 
a  procedure  is  proposed  and  the  infimum  of  the  probability  of  a  correct 
decision  (PCD)  is  expressed  in  terms  of  the  models  obtained  by  dropping  one 
predictor  variable  at  a  time.  Section  3  discusses  asymptotic  results  and 
establishes  the  (asymptotic)  least  favorable  configuration  for  PCD.  However, 
this  still  does  not  make  the  calculation  of  the  necessary  constant  easy.  The 
next  section  describes  a  Monte  Carlo  method  of  determining  the  constant.  A 
few  facts  which  can  be  effectively  used  in  reducing  the  amount  of  computations 
needed  in  implementing  the  procedure  are  discussed  in  Section  5. 

2.  FORMULATION  OF  THE  PROBLEM 
Consider  the  model 

Yj  =  B0  +  elxjl  +,"+  Bp-lxj,p-l  +  j  =  (2,1) 

where  x.  ,  j  =  1,...,N,  are  fixed  levels  of  the  predictor  variables 

J  5  * 

i»  the  are  unknown  parameters,  and  the  6j  are  independent 

2 

normal  random  variables  with  mean  zero  and  variance  Og.  Let 

Y'  =  ( Y-j » •  •  •  >Y|^) ,  $'  =  (Bg, .  • . » Bp_ i ) ,  =  (€•],...  ,6^) , 

1'  =  (1,...,1),  and  x^  =  (xli,...,xNi),  i  =  l,...,p-l.  Then  the  model 
(2.1)  can  be  written  in  the  familiar  matrix  form 

Y  =  XB  +  €  (2.2) 


where  X  =  (1  x-j  ...  x  i )  and  the  rank  of  X  is  assumed  to  be  p  <  N. 

It  is  assumed  that  (2.2)  represents  the  "true"  model.  We  wish  to  compare 


riTv  '****•;  ?™«**v**- 
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with  this  true  model  all  the  models  that  can  be  obtained  by  taking  only 
some  of  the  predictor  variables.  In  order  to  define  inferior  models,  we 
need  a  measure  of  goodness  of  a  model.  For  any  fixed  a  =  0,l,...,p-l, 
consider  all  the  (p’^)  subsets  of  the  set  of  predictor  variables 
{x-| ,. . .  ,Xp_i }  and  the  corresponding  reduced  models  obtained  from  (2.2). 
Associated  with  these  reduced  models  are  the  multiple  correlation  coeffi¬ 
cients  R^  ,  i  =  1,2,...,(p‘^).  The  indexing  of  these  reduced  models 

2 

can  be  done  in  an  arbitrary  manner.  Let  0.  =  E(l-R.  ).  Then  the 

J  1  ,a  1  ,a 

goodness  of  a  reduced  model  is  defined  by  comparing  0.  for  the  model 

I  jOl 

with  the  parameter  6^  of  the  true  (full)  model. 


Definition  2.1 .  A  reduced  model  whose  associated  parameter  e.  a  is 

said  to  be  inferior  if  e-|  p_-|  6*  where  6*  6  (0,1)  is  a  specified 

constant. 

It  is  to  be  noted  that  comparison  of  models  based  on  0^  is  equiva  ent 
to  that  based  on  the  expected  residual  sums  of  squares  in  the  ANOVA  of  these 
models.  However,  it  is  more  practical  to  fix  6*  in  relation  to  multiple 
correlation  coefficients  as  they  are  unit-free. 

The  true  model  is,  of  course,  the  best  model.  While  eliminating  the 
inferior  models,  we  do  not  want  to  overly  reject  good  models.  Formally  stated, 
our  goal  is:  Select  e  subset  of  all  possible  models  with  preferably  a  large 
subset  size  so  that  all  the  inferior  models  are  excluded  from  the  selected 
subset  of  models  with  a  guaranteed  minimum  probability  P*  (0  <  P*  <  1). 


Definition  2.2.  Model  A  is  said  to  be  a  submodel  of  Model  B  if  the  set 
of  predictor  variables  of  A  is  a  subset  of  that  of  B. 

Since  the  multiple  correlation  coefficient  for  a  model  cannot  be  smaller 
than  the  coefficient  for  any  of  its  submodels,  we  make  the  following  remark. 


Remark  2,1.  If  any  model  is  inferior,  then  all  its  submodels  are 
inferior. 

The  number  of  inferior  models,  t^,  is  unknown.  Of  course,  0  ^  t-j  <  t-1, 


where  t  =  2p-1-l.  Let  P(t-|)  denote  the  set  of  all  parametric  configurations 

that  give  rise  to  exactly  t.,  inferior  models  and  let  n  =  u 
- -  t1 


aUp, 
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We  now  propose  a  procedure  based  on  the  sample  multiple  correlation 

coefficients  for  the  different  models.  Let  R.  denote  the  samplecoefficient 

^  ,a  „  o 

corresponding  to  the  model  associated  with  0.  and  set  0,.  =  1  -  Rv  . 

1  ,a  1 ,a  1  ,a 

3.  PROCEDURE  ft 

The  proposed  procedure  A  is:  exclude  from  the  selected  subset  any 
model  for  which 


e 


l.P-l 


(3.1) 


where  the  constant  c  =  c(N,p,P*)  >  6*  is  determined  such  that  P(CD|ft),  the 
probability  of  a  correct  decision  using  ft,  satisfies  the  inequality 

P(CD|ft)>P*.  (3.2) 

We  first  note  that  if  any  model  is  excluded  by  a  ,  then  all  its 
submodels  are  also  excluded.  Further,  we  need  only  to  determine  the  ratio 
c/6*  =  d  (say). 

For  a  parametric  configuration  in  fl(t-j), 

P(CD|ft)  >_  Pr{0.j  2  >  de ]  i ,  i  =  1 , . . .  ,p-l } .  (3.3) 

The  above  inequality  is  obvious  because  of  Remark  2.1  and  the  fact  that 
the  right-hand  side  of  (3.3)  is  the  probability  of  a  correct  decision  when 
t-j  =  t-1.  Consequently, 


inf  P(CD|a)  =  inf  Pr{e.  , 

a  n  1,p 


>  d  0 


l,p-l’ 


1,...,P-1>. 


(3.4) 


Let  denote  the  matrix  obtained  from  X  by  deleting  the  column 
vector  x.j ,  and  denote  the  vector  obtained  from  §  by  leaving  out. 
Consider  the  (p-1)  reduced  models  given  by 


Y  = 

where  6^  *  N(0,  o^ln). 


i  =  1 , . . .  ,p-l ,  (3.5) 

It  should  be  noted  that  in  stating  the  reduced  model 
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(3.5) ,  we  mean  that  the  model  is  used  for  prediction  purposes  using  only  the 

(p-1)  variables  of  X^y  However,  our  comparisons  of  models  are  made  under 

the  true  model  assumptions.  The  expectation  of  the  residual  mean  square  in 

2 

the  corresponding  ANOVA  evaluated  under  the  true  model  is  given  by  the 
result  (e)  at  the  end  of  this  section.  The  reduced  model  described  in 

(3.5)  reflects  this  fact. 

Now,  let  SS.j  denote  the  residual  sum  of  squares  in  the  ANOVA  correspond¬ 
ing  to  the  model  with  and  let  SSg  denote  the  residual  sum  of  squares  in 

the  ANOVA  of  the  full  model.  Then  we  can  summarize  our  discussion  thus  far 
in  the  following  theorem. 

Theorem  3.1.  For  the  procedure  £  defined  in  (3.1), 

inf  P(CD|ft)  >  inf  Pr{SS.  >_  d  SSQ,  i  =  l,...,p-l).  (3.6) 

a  si  1  u 

Exact  evaluation  of  the  infimum  on  the  right-hand  side  of  (3.6)  is 
difficult.  We  take  recourse  to  asymptotic  theory  and  try  to  achieve  the 
probability  requirement  in  (3.2)  for  large  N  in  the  next  section.  We  state 
below  a  few  well-known  results  in  regression  theory  which  we  need. 

(a)  SSi  =  Y'{ 1  -  x(i)(x(i)  x(i))'1x(i)}?  =  I'Qi!*  saV- 

(b)  SSQ  =  Y* { 1  -  X(X*X)"1X,}Y  =  Y'Q0Y,  say, 

(c)  SS.j/0g  -  x2(v,  (X§)  'Q.(X§)/2oq)  (under  the  true  model) 

2 

where  v  s  N  -  p+1  and  x  (v,  x)  denotes  the  noncentral  chi-square 
distribution  with  v  degrees  of  freedom  and  noncentrality  parameter  x. 

2  2 

(d)  SSq/oq  -  x  (v0),  the  (central)  chi-square  distribution  with 
vg  =  N-p  degrees  of  freedom. 

(e)  a2  =  ag  +  (X§) 'Q^ (X§)/v. 

4.  ASYMPTOTIC  RESULTS 

2 

Since  our  rule  is  invariant  with  respect  to  og  >  0,  we  can  assume  that 
og  =  1.  Following  Arvesen  and  McCabe  (1975),  we  write  Y ' Y  =  UlU^, 
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i  =  0,l,...,p-l,  where  Ui  =  Y  with  =  1  and  B^  =  Qi .  Here  Bi  is 

a  v  x  n  matrix  for  i  =  l,...,p-l  and  BQ  is  a  vg  x  n  matrix.  The  joint 

distribution  of  U'  =  (Uq,  UJ . . ,Up_-j )  is  multivariate  normal  in  vq  +  (p-l)v 

dimensions  with  mean  vector  n'  =  {r\’Q,  nj..-.»np_j)  with  =  B^p, 

i  =  0,1 ... . ,p-l ,  and  covariance  matrix  z  =  ( e .  • )  where  z.  ■  =  B.B^.  Note 

that  z  is  possibly  singular. 

Now,  letting 

=  (SSi  -  v  -  )/ /2v  +  4n ^  ,  i  =  l,...,p-l, 

(4.1 

Z0  =  <sso  ~  ’ 

we  have 

Pr{SSj  >  d  SSQ,  i  =  l,...,p-l} 


=  Pr{- 


SSi 


SSr 


/2v  +  /2vg 


/ 


2v, 


2v  +  4nT  n.j 


i  =  1,. 


,p-l) 


Pr{- 


SSi 


/2Z  +  4n! 


Qt  3i 


!!o_ 


r_o_ 

V 


i  =  1 ,. . .  ,p-l} 


=  Pr{Zi 


+  v  +  aj  Qj 

/2v  +  4n]  n.j 


i  =  l,...,p-l} 


The  last  inequality  follows  from  the  fact  that  >_  0  and 

<t> ( t )  *  (v  +  t)//v  +  2t  is  strictly  increasing  in  t  >_  0.  Thus  we  have 
Pr  SSi  >  d  SS0,  i  =  1 _ _ ,p-l } 

>  Pr(Zi  >  d/^  ZQ  +  ^  -  f\  ,  i  -  1 . p-11.  (4.2) 

/2v 
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It  can  be  easily  seen  that  we  have  equality  in  (4.2)  when  n-  =  0.  Also, 

the  joint  distribution  of  Zg,  ^  does  not  depend  upon  ng, 

for  large  N.  This  shows  that  the  worst  configuration  (asymptotically)  for 
Pr{SS.j  >  d  SSg,  i  =  l,...,p-l}  is  when  §  =  0.  Thus  we  can  achieve  the 
probability  requirement  (3.2)  for  large  N  if  d  is  determined  by 

Pr{Z.  >  d/^  Zn  +  — ^  -  Jl  ,  i  =  =  P*  (4.3) 

1  /  v  0  /2^7  V  2 

where  Z’  =  (Zg,Z-| , . . .  >Zp_-| )  has  a  multivariate  normal  distribution  with  mean 
zero  and  covariance  matrix  r  =  (p-jj)  with 


and 


pij  =  v  tr(EijSji)  =  v  tr(QjQi}’  1  *  j’ 


p0j  “  y— —  tr(s0jz j0^  "  r^~  j  = 

^0  /vvo 


5.  EVALUATION  OF  THE  CONSTANT 


1 . . ,P-1 , 


The  evaluation  of  the  constant  d  can  be  done  using  an  Edgeworth 
approximation  of  order  1//N  as  discussed  by  Arvesen  and  McCabe  (1975). 

But,  as  they  have  remarked,  this  may  be  a  formidable  problem  for  p  >_  4  or 
5.  So  we  resort  to  Monte  Carlo  technique.  The  steps  involved  are  described 
below. 

(1)  Generate  random  observations  Y-|,...,Y^  from  a  standard 
normal  distribution. 

(2)  Calculate  SSi ,  i  =  0,l,...,p-l. 

(3)  Form  the  ratio  A  =  min  SS,/SSn. 

l<i<p-l  1  u 

(4)  Repeat  steps  1  to  3  m  times  retaining  the  values  A-j ,  A^.-.-.A  . 

(5)  Denote  the  ordered  A^  by  A^-j  <_. .  .£  A|-mj. 

Then  the  estimate  of  d  is  d  =  AQr+i ] »  where  r  is  an  integer  such  that 
r/m  <_  (1-P*)  <  (r+1  )/m. 
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Based  on  the  experiences  of  McCabe  and  Arvesen  (1974)  with  their 
problem,  it  appears  that  m  =  1000  may  give  adequate  estimates. 

6.  IMPLEMENTATION  OF  THE  PROCEDURE 

The  procedure  ft  in  (3.1)  can  be  restated  as  follows. 


ft:  Exclude  any  model  if  the  corresponding  residual  sum  of  squares 
SSj  >_  d  SSq.  The  implementation  of  the  procedure  is  straightforward  since 

it  involves  only  the  evaluations  of  the  residual  sums  of  squares  of  all  the 
models.  However,  it  may  not  be  necessary  to  compute  the  residual  sums  of 
squares  for  all  the  reduced  models.  It  should  be  remembered  that  the  rejection 
of  any  model  implies  the  rejection  of  all  its  submodels.  For  example,  if 
we  have  p-1  =  4  predictor  variables.  We  first  consider  all  the  one  variable 
models.  Suppose  that  the  models  { x^ }  and  {x^}  are  selected  and  the  models 
{ X2 1  and  {x3>  are  rejected.  This  automatically  means  that  the  models 

{ X 1  ,  X£  )  ,  {X^,  X3  }  ,  {X-j,  X^  }  ,  {  X^ ,  X^  }  ,  { X3 ,  X^ }  ,  {X-j,  X^,  X3 )  ,  { X-j  ,  X^,  X4}  , 

{ x  1 ,  x3,  x4),  {x^,  x3,  x^}  and  {x^  X£,  x3,  x^}  are  selected.  It  leaves  only 
(x^,  x3>  to  be  considered  next. 
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teed  minimum  probability.  A  procedure  is  proposed  for  which  the  exact  evaluation 
of  the  probability  of  a  correct  decision  is  difficult;  however,  it  is  shown  that 
the  probability  requirement  can  be  met  for  sufficiently  large  sample  size.  Monte 
Carlo  evaluation  of  the  constant  associated  with  the  procedure  and  some  ways  to 
reduce  the  amount  of  computations  involved  in  the  implementation  of  the  procedure  - 


